| 組員 | 系級 | 學號 | 工作分配 |
|---|---|---|---|
| 陳品瑄 | 資科碩一 | 112971018 | 數據概要、R markdown 彙整 |
| 林宴葶 | 資科碩一 | 112971022 | 數據進行清理和處理特徵工程 |
| 傅國書 | 資科碩一 | 112971025 | 負責模型的訓練及測試部份 |
| 張祐誠 | 資科碩一 | 112971013 | 分析流程與程式架構規劃、程式彙整、報告說明撰寫 |
Retension(客戶留存) 對於每一間企業的客戶關係管理 Customer Relationship Management (CRM) 都是十分重要的。
這個比率關乎企業在帶進行新客戶時,可以留下多少客戶。企業要留得住客戶,才有辦法長久發展。
在這個期末報告中,我們將使用 Kaggle Telco-customer-churn 的資料集,找出哪些客戶可能不再使用公司服務。
對客戶流失前進行挽留的行銷活動,為企業留住客戶。
在這個專案中,我們實作了一個在客戶關係管理中,偵測哪些客戶會流失的模型
我們採用了預測能力非常高的 XGBoost 模型做預測
採用 ROC, AUC, Lift Analysis 以及 Null Model Analysis 來評估模型成效
在預測的結果中,我們可以針對最容易流失的客戶,讓企業對潛在流失戶做精準行銷挽留客戶
實務情境
預測行為以留住客戶。
可以分析所有相關的客戶資料並進一步制定有針對性的客戶保留計劃。
Raw Data
每個 row 代表一個客戶。
每個 column 代表所描述的客戶屬性。
Detailed Information
Churn:
前一個月內離開的客戶
每位客戶註冊的服務:
家庭電話服務、多條電話線路、網路服務、附加網路安全服務、線上備份服務、網路設備附加保護計劃、額外的技術支援服務、串流媒體電視節目、電影
客戶訂閱資訊:
該客戶訂閱服務時間、合約、付款方式、無紙化帳單、每月費用、總費用等
客戶個資:
性別、年齡區間、獨居或與他人同住等
| Column Name | Description | Remark |
|---|---|---|
| CustomerID | 識別每個客戶的唯一 ID | String |
| Gender | 顧客的性別 | Male, Female |
| Age | 財務季度結束時客戶的當前年齡(以年為單位) | Number |
| Senior Citizen | 是否年滿 65 歲 | Yes, No |
| Married (Partner) | 是否已婚 | Yes, No |
| Dependents | 是否與任何家屬同住 | Yes, No. |
| Number of Dependents | 是否與任何受扶養人同住 | Yes, No. |
| Phone Service | 是否向公司訂購了家庭電話服務 | Yes, No |
| Multiple Lines | 是否與公司預訂了多條電話線路 | Yes, No |
| Internet Service | 是否向本公司訂購網路服務 | No, DSL, Fiber Optic, Cable. |
| Online Security | 是否訂閱了公司提供的附加線上安全服務 | Yes, No |
| Online Backup | 是否訂閱了本公司提供的附加線上備份服務 | Yes, No |
| Device Protection Plan | 是否訂閱了該公司為其互聯網設備提供的附加設備保護計劃 | Yes, No |
| Premium Tech Support | 是否訂閱了公司的附加技術支援計劃以減少等待時間 | Yes, No |
| Streaming TV | 是否使用其網路服務從第三方供應商串流媒體電視節目 | Yes, No. |
| Streaming Movies | 是否使用其 Internet 服務從第三方供應商串流影片 | Yes, No. |
| Contract | 客戶目前的合約類型 | Month-to-Month, One Year, Two Year. |
| Paperless Billing | 客戶是否選擇無紙化計費 | Yes, No |
| Payment Method | 客戶如何支付帳單 | Bank Withdrawal, Credit Card, Mailed Check |
| Monthly Charge | 客戶目前每月為本公司提供的所有服務支付的總費用 | Number |
| Total Charges | 截至上述指定季度末計算的客戶總費用 | Number |
| Tenure | 客戶訂閱服務時間 | Number |
| Churn | 是 = 客戶本季離開了公司;否 = 客戶仍留在公司 | Yes, No |
什麼樣特徵的人容易 Churn?
誰會 Churn? 準確度多少?
introduceintro <- introduce(dataset)
kable(intro)
| rows | columns | discrete_columns | continuous_columns | all_missing_columns | total_missing_values | complete_rows | total_observations | memory_usage |
|---|---|---|---|---|---|---|---|---|
| 7043 | 21 | 17 | 4 | 0 | 11 | 7032 | 147903 | 1641832 |
rows = 7043
columns = 21
discrete_columns = 17
continuous_columns = 4
all_missing_columns = 0
total_missing_values = 11
complete_rows = 7032
total_observations = 147,903
memory_usage = 1,641,832 (Bytes)
plot_intro資料基本描述
columns (features) : 離散資料欄位、連續資料欄位、無資料欄位
complete rows (customers) : 完整資料的客戶數
missing observations : 缺失値
plot_intro(dataset)
plot_missing只顯示有 missing value 的欄位
plot_missing(dataset, missing_only = TRUE)
plot_bar離散資料
以顏色區分 Churn
plot_histogram連續資料
每月費用
該客戶訂閱服務時間
總費用
plot_boxplotplot_prcompThe steps involved in data cleaning:
Check whether the data types are correct for each
variable using str() function.
Handling Missing Values:
2.1. Perform KNN (K-Nearest Neighbors) imputation specifically for the “TotalCharges” variable.
Standardizing Data(Convert text to a consistent case):
3.1. Conditionally transform values that start with “N” and replace them with “No”.
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No" "No" "No" "No" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
The steps involved in feature engineering:
The removal of ‘TotalCharges’ from the model is justified due to its high correlation with ‘MonthlyCharges’ and ‘tenure’.
Since ‘TotalCharges’ is mathematically derived as the product of ‘MonthlyCharges’ and ‘tenure’,
it does not provide additional independent information.
Including ‘TotalCharges’ in the model can lead to redundancy and potential multicollinearity issues.
##
## Call:
## lm(formula = TotalCharges ~ MonthlyCharges + tenure, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1941.8 -468.6 -96.3 493.9 3474.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2152.3306 22.0491 -97.62 <2e-16 ***
## MonthlyCharges 35.8332 0.3016 118.80 <2e-16 ***
## tenure 65.2792 0.3696 176.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 737.9 on 7040 degrees of freedom
## Multiple R-squared: 0.894, Adjusted R-squared: 0.894
## F-statistic: 2.968e+04 on 2 and 7040 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: TotalCharges
## Df Sum Sq Mean Sq F value Pr(>F)
## MonthlyCharges 1 1.5334e+10 1.5334e+10 28164 < 2.2e-16 ***
## tenure 1 1.6988e+10 1.6988e+10 31201 < 2.2e-16 ***
## Residuals 7040 3.8331e+09 5.4447e+05
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA table indicates that both the variables “MonthlyCharges” and “tenure” have a significant influence on the variable “TotalCharges.”
The significance is determined by the associated p-values, which are described as “extremely low” in this statement.
A low p-value typically suggests strong evidence against the null hypothesis,
indicating that the observed relationship between the predictor variables and the outcome variable is unlikely to be due to chance alone.
2.Encoding Categorical Variables:
Label Encoding: Converting categorical data to numbers where the order matters.
One-Hot Encoding: Converting categorical data to a binary (0 or 1) format.
3.Scaling and Normalization:
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No" "No" "No" "No" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
## 'data.frame': 7043 obs. of 25 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : int 0 1 1 1 0 0 1 0 0 1 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : int 1 0 0 0 0 0 0 0 1 0 ...
## $ Dependents : int 0 0 0 0 0 0 1 0 0 1 ...
## $ tenure : num -1.2774 0.0663 -1.2366 0.5142 -1.2366 ...
## $ PhoneService : int 0 1 1 0 1 1 1 0 1 1 ...
## $ MultipleLines : int 0 0 0 0 0 1 1 0 1 0 ...
## $ InternetService : int 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : int 0 1 1 1 0 0 0 1 0 1 ...
## $ OnlineBackup : int 1 0 1 0 0 0 1 0 0 1 ...
## $ DeviceProtection : int 0 1 0 1 0 1 0 0 1 0 ...
## $ TechSupport : int 0 0 0 1 0 0 0 0 1 0 ...
## $ StreamingTV : int 0 0 0 0 0 1 1 0 1 0 ...
## $ StreamingMovies : int 0 0 0 0 0 1 0 0 1 0 ...
## $ ContractMonth.to.month : int 1 0 1 0 1 1 1 1 1 0 ...
## $ ContractOne.year : int 0 1 0 1 0 0 0 0 0 1 ...
## $ ContractTwo.year : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PaperlessBilling : int 1 0 1 0 1 1 1 0 1 0 ...
## $ PaymentMethodBank.transfer..automatic.: int 0 0 0 1 0 0 0 0 0 1 ...
## $ PaymentMethodCredit.card..automatic. : int 0 0 0 0 0 0 1 0 0 0 ...
## $ PaymentMethodElectronic.check : int 1 0 0 0 1 1 0 0 1 0 ...
## $ PaymentMethodMailed.check : int 0 1 1 0 0 0 0 1 0 0 ...
## $ MonthlyCharges : num -1.16 -0.26 -0.363 -0.746 0.197 ...
## $ Churn : int 0 0 1 0 1 1 0 0 1 0 ...
library(readr)
library(ggplot2)
library(lattice)
library(caret)
library(xgboost)
library(pROC)
library(dplyr)
library(png)
data <- read_csv("../data/feature/feature_set.csv")
Separate the data into training and testing set
n <- nrow(data)
data <- data[sample(n),] #將資料進行隨機排列
index <- createDataPartition(data$Churn, p = 0.8, list = FALSE)
train_data <- data[index, ]
test_data <- data[-index, ]
Using training set to train the xgboost model.
為何選擇 XGBoost Model ?
XGBoost (eXtreme Gradient Boosting) 是一種梯度提升樹模型,它是通過迭代地訓練一系列決策樹,
每一棵樹都是針對前一棵樹的殘差進行訓練,然後將這些決策樹組合以得出最終預測結果。
XGBoost 具有下列優點:
高準確性:XGBoost 擅長處理大量數據和複雜模式,因此通常具有較高的預測準確性。
強大的正則化:XGBoost 支持 L1 和 L2 正規化,這有助於防止over fitting,提高模型的泛化能力(Generation Ability)。
快速執行速度:相對於其他機器學習算法,XGBoost 通常執行速度較快,尤其是在大型數據集上。
train_matrix <- xgb.DMatrix(data = as.matrix(train_data[-c(1, which(names(train_data) == "Churn"))]), label = train_data$Churn)
test_matrix <- xgb.DMatrix(data = as.matrix(test_data[-c(1, which(names(test_data) == "Churn"))]), label = test_data$Churn)
k-fold cross validation
objective:模型的目標函數,採用
binary:logistic 執行二元分類任務,使用 logistic
回歸作為目標函數
eval_metric:評估指標的設置,採用 auc
評估模型對二元分類問題的預測能力
eta:learning rate,eta = 0.1
max_depth:樹的最大深度限制,此參數控制每棵決策樹的深度,避免over
fitting,max_depth = 6
nrounds:訓練迭代的次數,nrounds = 100
nfold:cross validation
fold數,nfold = 5
early_stopping_rounds:提前停止的輪數,當模型在連續幾輪迭代中沒有顯著改善時提前停止訓練避免過度擬合,early_stopping_rounds = 10
verbose:顯示訓練過程的詳細程度,verbose = 1
表示在訓練過程中顯示詳細訊息,包含每輪訓練的效果
params <- list(
objective = "binary:logistic",
eval_metric = "auc",
eta = 0.1,
max_depth = 6
)
nrounds <- 100
nfold <- 5
early_stopping_rounds <- 10
verbose <- 1
# using xgb.cv to do k-fold cross validation
cv_result <- xgb.cv(
params = params,
data = train_matrix,
nrounds = nrounds,
nfold = nfold,
early_stopping_rounds = early_stopping_rounds,
verbose = verbose
)
## [1] train-auc:0.860307+0.004076 test-auc:0.828004+0.006441
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 10 rounds.
##
## [2] train-auc:0.866141+0.002440 test-auc:0.833435+0.008350
## [3] train-auc:0.870326+0.001591 test-auc:0.832497+0.007450
## [4] train-auc:0.874660+0.001329 test-auc:0.835551+0.007451
## [5] train-auc:0.876655+0.001413 test-auc:0.835538+0.008015
## [6] train-auc:0.878598+0.001266 test-auc:0.836215+0.007620
## [7] train-auc:0.880688+0.001869 test-auc:0.836877+0.008243
## [8] train-auc:0.882262+0.001903 test-auc:0.837644+0.008654
## [9] train-auc:0.883513+0.001671 test-auc:0.838221+0.009416
## [10] train-auc:0.885013+0.001933 test-auc:0.838527+0.009261
## [11] train-auc:0.886721+0.001798 test-auc:0.838315+0.008828
## [12] train-auc:0.888365+0.001916 test-auc:0.838771+0.008729
## [13] train-auc:0.889695+0.001925 test-auc:0.839133+0.008536
## [14] train-auc:0.891292+0.001667 test-auc:0.839106+0.008027
## [15] train-auc:0.892566+0.001700 test-auc:0.839506+0.008265
## [16] train-auc:0.893961+0.001541 test-auc:0.839976+0.008245
## [17] train-auc:0.895306+0.001387 test-auc:0.839995+0.008201
## [18] train-auc:0.896708+0.001388 test-auc:0.840057+0.008032
## [19] train-auc:0.897940+0.001159 test-auc:0.840368+0.007561
## [20] train-auc:0.899059+0.001269 test-auc:0.840637+0.007114
## [21] train-auc:0.900143+0.001256 test-auc:0.840956+0.007209
## [22] train-auc:0.901163+0.001139 test-auc:0.841263+0.007475
## [23] train-auc:0.902195+0.001252 test-auc:0.841567+0.007375
## [24] train-auc:0.903078+0.001282 test-auc:0.841730+0.007139
## [25] train-auc:0.903918+0.001217 test-auc:0.841516+0.006927
## [26] train-auc:0.904833+0.001328 test-auc:0.841243+0.007072
## [27] train-auc:0.905554+0.001268 test-auc:0.841242+0.007132
## [28] train-auc:0.906447+0.001465 test-auc:0.841297+0.007179
## [29] train-auc:0.907339+0.001534 test-auc:0.841285+0.007280
## [30] train-auc:0.908167+0.001409 test-auc:0.841272+0.007404
## [31] train-auc:0.908796+0.001469 test-auc:0.841171+0.007492
## [32] train-auc:0.909617+0.001652 test-auc:0.841139+0.007663
## [33] train-auc:0.910516+0.001762 test-auc:0.841055+0.007581
## [34] train-auc:0.911145+0.001729 test-auc:0.841068+0.007440
## Stopping. Best iteration:
## [24] train-auc:0.903078+0.001282 test-auc:0.841730+0.007139
print(cv_result)
## ##### xgb.cv 5-folds
## iter train_auc_mean train_auc_std test_auc_mean test_auc_std
## <num> <num> <num> <num> <num>
## 1 0.8603065 0.004076327 0.8280038 0.006440679
## 2 0.8661413 0.002440427 0.8334346 0.008350267
## 3 0.8703258 0.001590904 0.8324974 0.007450230
## 4 0.8746600 0.001329350 0.8355510 0.007451385
## 5 0.8766548 0.001413167 0.8355376 0.008014822
## 6 0.8785984 0.001266467 0.8362155 0.007619693
## 7 0.8806884 0.001868771 0.8368768 0.008243174
## 8 0.8822618 0.001902988 0.8376440 0.008654049
## 9 0.8835128 0.001670935 0.8382206 0.009416277
## 10 0.8850133 0.001932508 0.8385274 0.009260904
## 11 0.8867211 0.001797919 0.8383146 0.008828244
## 12 0.8883654 0.001915721 0.8387709 0.008729361
## 13 0.8896951 0.001924556 0.8391330 0.008536208
## 14 0.8912923 0.001666633 0.8391056 0.008027331
## 15 0.8925659 0.001699530 0.8395063 0.008265032
## 16 0.8939612 0.001540936 0.8399763 0.008244889
## 17 0.8953061 0.001387100 0.8399946 0.008200562
## 18 0.8967085 0.001388398 0.8400572 0.008031513
## 19 0.8979400 0.001159150 0.8403677 0.007560561
## 20 0.8990592 0.001269241 0.8406368 0.007114183
## 21 0.9001429 0.001255712 0.8409563 0.007208587
## 22 0.9011627 0.001139289 0.8412625 0.007474808
## 23 0.9021947 0.001251781 0.8415673 0.007375322
## 24 0.9030784 0.001281752 0.8417299 0.007139130
## 25 0.9039180 0.001216841 0.8415156 0.006927367
## 26 0.9048328 0.001327959 0.8412432 0.007072440
## 27 0.9055540 0.001268048 0.8412416 0.007131575
## 28 0.9064470 0.001465041 0.8412967 0.007178618
## 29 0.9073394 0.001534128 0.8412848 0.007279856
## 30 0.9081668 0.001408883 0.8412723 0.007404442
## 31 0.9087965 0.001468996 0.8411710 0.007492119
## 32 0.9096171 0.001652398 0.8411390 0.007663317
## 33 0.9105164 0.001761690 0.8410549 0.007580954
## 34 0.9111447 0.001729498 0.8410683 0.007439814
## iter train_auc_mean train_auc_std test_auc_mean test_auc_std
## Best iteration:
## iter train_auc_mean train_auc_std test_auc_mean test_auc_std
## <num> <num> <num> <num> <num>
## 24 0.9030784 0.001281752 0.8417299 0.00713913
高AUC評分:反覆運算的 AUC 評分達到整個反覆運算過程中最高
合理的標準差:標準差相對較小,表明模型的性能在不同的交叉驗證折次中較為穩定
相比之下,後續的反覆運算雖然也達到了類似的 AUC 評分,但並沒有顯著超過最佳迭代次數的 AUC 評分,
而且有些反覆運算的 AUC 評分反而有所下降。
例如:
超過最佳迭代次數後反覆運算的 test-auc 降低。
此外,在最佳迭代次數後,儘管訓練集上的 AUC 評分(train-auc)不斷上升,
但測試集上的 AUC 評分並沒有明顯改善,甚至有所波動,這可能表明模型開始 over fitting 訓練數據。
# Create DMatrix object
# data_matrix <- xgb.DMatrix(data = as.matrix(data[-c(1, which(names(data) == "Churn"))]), label = data$Churn)
# get the best iteration
best_nrounds <- cv_result$best_iteration
model <- xgb.train(
params = params,
data = train_matrix,
nrounds = best_nrounds,
verbose = TRUE
)
train_pred <- predict(model, train_matrix)
train_predicted_label <- ifelse(train_pred > 0.5, 1, 0)
test_pred <- predict(model, test_matrix)
test_predicted_label <- ifelse(test_pred > 0.5, 1, 0)
*使用模型對訓練集和測試集分別進行預測,然後將預測的機率值轉換成二分類標籤。
Compute the ROC of training and testing data
roc_train <- roc(train_data$Churn, train_pred)
roc_test <- roc(test_data$Churn, test_pred)
auc_train <- auc(roc_train)
auc_test <- auc(roc_test)
# Assuming roc_train_plot and roc_test_plot are ggplot objects for ROC curves
roc_train_plot <- ggroc(roc_train) + ggtitle("ROC Curve - Training Data")
roc_test_plot <- ggroc(roc_test) + ggtitle("ROC Curve - Testing Data")
# Save ROC curves
ggsave("roc_train.png", plot = roc_train_plot)
ggsave("roc_test.png", plot = roc_test_plot)
print(paste("AUC of training data: ", auc_train))
## [1] "AUC of training data: 0.895755315051635"
print(roc_train_plot)
print(paste("AUC of testing data: ", auc_test))
## [1] "AUC of testing data: 0.837689787447964"
print(roc_test_plot)
Null model prediction
根據 Chrun 的平均值來進行預測
# Null model prediction: using the mean of Churn in the training set as the probability
mean_churn <- mean(train_data$Churn)
null_train_pred <- rep(mean_churn, nrow(train_data))
null_test_pred <- rep(mean_churn, nrow(test_data))
# Compute the ROC of null model
roc_null_train <- roc(train_data$Churn, null_train_pred)
roc_null_test <- roc(test_data$Churn, null_test_pred)
auc_null_train <- auc(roc_null_train)
auc_null_test <- auc(roc_null_test)
# Print AUC values for the null model
print(paste("AUC for null model on training data: ", auc_null_train))
## [1] "AUC for null model on training data: 0.5"
print(paste("AUC for null model on testing data: ", auc_null_test))
## [1] "AUC for null model on testing data: 0.5"
# 使用 DeLong 檢驗比較兩個 AUC
roc_test_xgb <- roc(test_data$Churn, test_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_test_null <- roc(test_data$Churn, null_test_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# DeLong 檢驗
delong_test <- roc.test(roc_test_xgb, roc_test_null, method="delong")
# 輸出檢驗結果
print(delong_test)
##
## DeLong's test for two correlated ROC curves
##
## data: roc_test_xgb and roc_test_null
## Z = 29.841, p-value < 2.2e-16
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
## 0.3155105 0.3598691
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.8376898 0.5000000
DeLong 檢定是一種用於比較兩條相關的受試者操作特徵(ROC)曲線下面積(AUC)的統計方法
檢定統計量 Z:兩個模型的 AUC 差異較大
p-value:表明 AUC 之間的差異具有統計學意義。(通常 p < 0.05 被認為是顯著的)
AUC 差異的 95% 信賴區間:意味著我們有 95% 的信心認為真實的 AUC 差異在這個區間內。由於該區間不包含 0,這進一步支持 AUC 之間存在顯著差異的結論。
DeLong 檢定表明,兩條ROC曲線的 AUC 之間存在統計學上顯著的差異
從高Z值、極低的 p-value 以及 AUC 差異的信賴區間可以看出 XGBoost 顯著優於 null model
XGBoost 基於每個特徵在所有樹中減少損失的貢獻來計算這些分數
有助於了解哪些特徵對模型的預測能力貢獻最大
importance_matrix <- xgb.importance(model = model, feature_names = colnames(train_data[-c(1, which(names(train_data) == "Churn"))]))
xgb.plot.importance(importance_matrix)
***可以看出 Contract feature裡 Month.to.month:這個特徵具有最高的重要性分數,表示它是最重要的預測變數。
1. Save the prediction result
train_results <- data.frame(customerID = train_data$customerID, label = train_predicted_label, probability = train_pred, groundtruth = train_data$Churn)
test_results <- data.frame(customerID = test_data$customerID, label = test_predicted_label, probability = test_pred, groundtruth = test_data$Churn)
write_csv(train_results, "train_predictions.csv")
write_csv(test_results, "test_predictions.csv")
head(test_results, 5)
## customerID label probability groundtruth
## 1 8361-LTMKD 1 0.5134749 1
## 2 2250-IVBWA 0 0.2028230 0
## 3 9605-WGJVW 1 0.8340625 0
## 4 3787-TRIAL 0 0.3369451 0
## 5 7733-UDMTP 0 0.4100148 0
2. Save the feature importance result
write_csv(importance_matrix, "feature_importance.csv")
head(importance_matrix, 5)
## Feature Gain Cover Frequency Importance
## <char> <num> <num> <num> <num>
## 1: ContractMonth.to.month 0.42235256 0.1734117 0.02118989 0.42235256
## 2: tenure 0.19345689 0.2084627 0.24042380 0.19345689
## 3: InternetService 0.13317884 0.1427151 0.04074980 0.13317884
## 4: MonthlyCharges 0.10192573 0.1608722 0.32355338 0.10192573
## 5: PaymentMethodElectronic.check 0.02165991 0.0642118 0.04645477 0.02165991
3. Save the XGBoost model
xgb.save(model, "../model/churn_prediction_model.xgb")
## [1] TRUE
4. Save the ROC curve
# Assuming roc_train_plot and roc_test_plot are ggplot objects for ROC curves
roc_train_plot <- ggroc(roc_train) + ggtitle("ROC Curve - Training Data")
roc_test_plot <- ggroc(roc_test) + ggtitle("ROC Curve - Testing Data")
# Save ROC curves
ggsave("roc_train.png", plot = roc_train_plot)
## Saving 7 x 5 in image
ggsave("roc_test.png", plot = roc_test_plot)
## Saving 7 x 5 in image
print(roc_train_plot)
print(roc_test_plot)
Lift analysis 是一種在資料科學和機器學習中常用的評估技術,尤其在行銷和推薦系統中十分常見,
主要目的是衡量一個策略、活動或模型相對於隨機選擇的效果提升。
以下是 Lift analysis 的一些關鍵點:
\(Lift\) 是一個比率,表示目標行為在有策略介入時的發生率,與無策略介入時的發生率之比。
公式為 \[ Lift = \frac {P(B|A)}{P(B)}\]
其中 \(P(B∣A)\) 是在條件 \(A\) 下發生 \(B\) 的概率,而 \(P(B)\) 是無條件下發生 \(B\) 的概率。
\(Lift\) 分析幫助確定某個特定行動或模型是否對結果有正向影響,以及這個影響是否顯著超過隨機事件。
(Credit by ChatGPT)
# sorting by probability
test_results <- test_results[order(-test_results$probability),]
head(test_results, 10)
## customerID label probability groundtruth
## 164 4510-HIMLV 1 0.8717860 1
## 628 3716-BDVDB 1 0.8717860 1
## 820 9728-FTTVZ 1 0.8717860 1
## 1170 7254-IQWOZ 1 0.8717860 1
## 813 0679-IDSTG 1 0.8660697 1
## 12 0495-RVCBF 1 0.8644475 1
## 716 3068-OMWZA 1 0.8605848 1
## 1132 5597-GLBUC 1 0.8605848 1
## 709 5393-RXQSZ 1 0.8591195 1
## 315 9507-EXLTT 1 0.8573178 1
# Segmenat the customer
test_results$decile <- cut(test_results$probability, breaks=quantile(test_results$probability, probs=seq(0, 1, by = 0.1)), include.lowest=TRUE, labels=FALSE)
# Reverse the decilne numbering
test_results$decile <- 11 - test_results$decile
head(test_results, 5)
## customerID label probability groundtruth decile
## 164 4510-HIMLV 1 0.8717860 1 1
## 628 3716-BDVDB 1 0.8717860 1 1
## 820 9728-FTTVZ 1 0.8717860 1 1
## 1170 7254-IQWOZ 1 0.8717860 1 1
## 813 0679-IDSTG 1 0.8660697 1 1
# 計算每個分組的實際響應率和 Lift
test_lift_df <- test_results %>%
group_by(decile) %>%
summarise(
count = n(),
num_responses = sum(label),
response_rate = mean(label),
lift = response_rate / mean(data$Churn)
)
plot <- ggplot(test_lift_df, aes(x = as.factor(decile), y = lift)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Lift Chart", x = "Decile", y = "Lift") +
theme_minimal()
# 使用 ggsave 保存圖形
ggsave("lift_chart.png", plot, width = 10, height = 6, dpi = 300)
img <- readPNG("lift_chart.png")
print(plot)
# Save the lift data to CSV
write.csv(test_lift_df, "lift_data.csv", row.names = FALSE)
head(test_lift_df, 10)
## # A tibble: 10 × 5
## decile count num_responses response_rate lift
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 141 141 1 3.77
## 2 2 141 141 1 3.77
## 3 3 141 17 0.121 0.454
## 4 4 140 0 0 0
## 5 5 141 0 0 0
## 6 6 140 0 0 0
## 7 7 141 0 0 0
## 8 8 139 0 0 0
## 9 9 140 0 0 0
## 10 10 144 0 0 0